Information in this document is provided in
connection with Intel products. No license, express or
implied, by estoppel or otherwise, to any intellectual
property rights is granted by this document. Except as
provided in Intel's Terms and Conditions of Sale for such
products, Intel assumes no liability whatsoever, and
Intel disclaims any express or implied warranty, relating
to sale and/or use of Intel products including liability
or warranties relating to fitness for a particular
purpose, merchantability, or infringement of any patent,
copyright or other intellectual property right. Intel
products are not intended for use in medical, life
saving, or life sustaining applications. Intel may make
changes to specifications and product descriptions at any
time, without notice. Copyright (c) Intel Corporation 1998. Third-party brands and names are the property of their respective owners. |
This application note shows optimizing techniques used to gain substantial performance improvement on the Quantization step in JPEG compression, running on a Pentium II Processor and a Pentium Processor with MMX(TM) Technology. Current JPEG Quantization takes 64 DCT frequency components, divides them by a "quantizer step size", and rounds them to integers to form quantized coefficients. C code and MMX(TM) Technology assembly implementations are presented. Performance results for both implementations are also summarized. The code provided in this application note can be plugged directly into the IJG (Independent JPEG Group) royalty free software with minimum code modifications to take advantage of MMX(TM) Technology. The modifications are listed in the Code Listing (Section 6.0).
JPEG uses the properties of the human eye to achieve 10 to 80 times compression. The JPEG baseline model consists of four stages: a transformation stage, a lossy quantization stage, and two lossless coding stages. The initial transform concentrates the information energy into the first few transform coefficients, the quantizer causes a controlled loss of information, and the two coding stages further compress the data. The YCbCr color space separates luminance(Y) from chrominance(Cb, Cr). The compression algorithm takes advantage of the fact that the human eye is more sensitive to luminance than chrominance. The sensitivity of the eye also increases at low intensity levels. The individual color components in the YCbCr color space are less correlated than in the RGB color space. Therefore this model can be applied to compress each YCbCR component individually.
The Quantization step is used to reduce the magnitude of DCT coefficients and to increase the number of zero value coefficients based on the eye's ability to detect different levels at a given frequency. The values are chosen to match the sensitivity of the eye. Small quantization values are chosen for low frequency and higher values for high frequency coefficients. The JPEG baseline model is considered a "lossy" compressor because the reconstructed image is not identical to the original. Lossless coders, which create images identical to the original, achieve inferior compression sizes than JPEG.
In preparation of the transformation step, the image is broken up into 8X8 pixels for each color component across the image. Video energy of the 8X8 blocks is scattered throughout the elements. If the variation of this video energy is slow across the image, a transform is used to concentrate this energy into few coefficients -- 2 dimensional DCT coefficients. The uniform midstep quantizer is used for the JPEG baseline method, where the step size is varied according to the coefficient location and which color component is encoded. Two separate Quantization tables are used, one each for luminance and chrominance. The equation for the quantizer can be written as:
Quantified Coefficients = DCT Frequency Coefficient/ Quantizer
The decompression step uses the inverse quantizer:
DCT Frequency Coefficient = Quantified Coefficients * Quantizer
Quantization is the lossy stage in the JPEG coding scheme. If quantization is too coarse, images look "blocky" but if its too fine, useless bits are spent coding (essentially) noise. Quantization can be controlled by the Quality Factor, a number which changes the default quantization matrix by an effective multiplicative factor. A lower quality image gives better compression and vise-versa.
Most JPEG compressors let you pick a file size versus image quality tradeoff by selecting a quality setting. For good quality full color source images, the default IJG quality setting (Q75) is very often the best choice. If the image was not high quality to begin with, dropping down to Q50 will not cause much degradation. Q95 is about the highest recommended quality. Q100 will generate a file 2 to 3 times larger than Q95 without much improvement. Images with sharp color edges may need higher quality setting to avoid jagged edges.
First step in optimizing was VTune profiling. This step identified the functions to concentrate on for the purpose of optimization. Optimizing the highly utilized functions allowed better overall performance gain.
The original code takes the equation for the Quantizer and rounds into integers by adding half the denominator to the numerator and then performing an integer divide:
Quantized Coefficients = (DCT Frequency Coefficient + (Quantizer/2)) / Quantizer
There were two data dependent branches in the original C code. The first one was to detect if the denominator is greater than numerator and set it to zero. This algorithm avoids the slow divide and is an "early out" mechanism. The second was to detect a whether the DCT Frequency Coefficient was positive or negative and round accordingly. If the DCT Frequency Coefficient was negative, the original program takes a data dependent branch and performs the rounding operation. A negative rounder is added to a negative coefficient and a positive is added to a positive coefficient. The Pentium II processor has a sophisticated branch prediction algorithm and random sign changes in the DCT coefficients will cause unpredictable branches. This penalty is 9 to 26 clock cycles for high performance Pentium II Processors. The first branch was eliminated altogether and the ratios and the rounders were precalculated and stored in a table. The second data dependent branch was eliminated all together by using a simpler rounding method without introducing any detectable error. The rounder was the same whether the data was positive or negative. This removes the data dependent branch.
Division is much slower than multiplication in the Pentium Processor and Pentium II Processors. Dividing by a constant is inefficient and should be precalculated. The C code equation of generating quantified coefficients contains 64 divides by constants for every 8X8 pixel block in the picture. This could be costly because divides in general are time consuming, are data dependent and non pipelined. This is an area where the JPEG Quantization code performance can be increased. The divisor Q table for both luminance and chrominance are already setup. Two new tables containing (2^16)/Q values are precalculated, multiplied with individual values of (F + Q/2) and shifted right 16 bits to accommodate for the 2^16 multiply. Here is the complete picture:
Quantized Coefficients = (DCT Frequency Coefficient + (Quantizer/2)) / QuantizerCreation of two new tables take a one time hit of 64 divides but that is negligible compared to 64 divides per 8X8 pixel block in a picture. The multiplication with "(2^16/Quantizer)" and the shift right using ">>16" is accomplished with the MMX(TM) Technology instruction PMULHW.
Unrolling the loop of the MMX Technology implementation from processing 4-words 16 times to processing 32-words twice gives a 25% performance increase. However, after a certain point, the improvement diminishes at the expense of code size increase and does not make sense.
The table below takes the cycle-count range for the C code implementation and compares it to the average of the MMX Technology implementation. In the C code implementation there are clearly two peaks for this picture, indicating the unpredictability of data dependent branching. The single peak MMX(TM) Technology implementation has no data dependent branch. If the result of quantization and rounding is zero then the C code has an early out mechanism causing the code to avoid the divide. This is due to the fact that divides are costly on processor cycles. The high performance algorithm removes the two data dependent branches and divides by precalculating a table.
Two deviations from the original IJG code are listed below:
Pentium II Processor |
Pentium II Processor |
Pentium Processor with MMX(TM) Technology |
Pentium Processor with MMX(TM) Technology |
||
C Routine | Optimized Routine | C Routine | Optimized Routine | ||
Cycles | Range | 2529 - 9270 |
248 - 894 |
2809 - 13126 |
335 - 2610 |
Average | 3360 |
311 |
3930 |
406 |
|
Performance Gain compared to C |
Minimum | 1X |
8X |
1X |
7X |
Average | 10.8X |
9.6X |
|||
Overall improvement in JPEG compression |
1X |
3X |
1X |
3X |
NOTES:
1) MMX technology in-line assembly code is compiled using Microsoft Visual C++ 5 with the compiler options set to produce Pentium code and optimization set to maximum speed.
2) Performance gain compared to C implementation = Cycle Count of C routine / Average Cycle Count of MMX routine.
3) System configuration:
Pentium II Processor: 266MHz, 32MB memory, 11ms HD seek time
Pentium Processor with MMX(TM) Technology: 233MHz, 64MB memory, 11ms HD seek time
The following graphs plot the cycles required to perform quantization of 8X8 blocks versus the percentage of the total number of blocks (16K). The four cases compare the Pentium II and Pentium Processor with MMX(TM) Technology with C code and Optimized implementations on the same photographic image. Due to the zero-early-out algorithm used in C code implementation, the peaks of the C code implementation will vary from one image to another.
This application note shows a successful use of MMX(TM) Technology instructions and Pentium II Processor optimized code to implement a JPEG Quantization algorithm. The optimized implementation demonstrated an approximately10X performance gain when compared to the original C implementation. The gain can be attributed to the substitution of divides with multiplies, removal of data dependent branch and SIMD instructions resulting in multiplication of four values in parallel in a single instruction.
By taking advantage of Intel's CPUID instruction, software developers can create software applications and tools that can execute compatibly across the widest range of Intel processor generations and models, past, present, and future. If after running CPUID instruction it is determined that the machine is capable of running MMX(TM) Technology instructions, a global switch needs to be turned "on" to take advantage of the MMX(TM) Code. If CPUID determines a processor without MMX(TM) Technology, the scalar C code can be used as a default path, maintaining compatibility with all previous generations of Intel processors.
jcdctmgr.c is the modified file. The file contains both the C code and MMX(TM) Technology implementations. To choose the MMX(TM) Technology code turn "on" "MMXAvailable" switch. For a detailed description of using CPUID instruction, see Intel Application note AP-485. Visit http://developer.intel.com/design/perftool/cpuid/ for source and DLL.
If you wish to run cjpeg.exe you need to download them from the following site. Free, portable C code for JPEG compression is available from the Independent JPEG Group. Source code, documentation, and test files are included.Version 6a is available from ftp.uu.net:/graphics/jpeg/jpegsrc.v6a.tar.gz.If you are on a PC you may prefer ZIP archive format, which you can find at ftp.simtel.net:/pub/simtelnet/msdos/graphics/jpegsr6a.zip (or at any Simtel mirror site). On CompuServe, see the Graphics Support forum(GO CIS:GRAPHSUP), library 12 "JPEG Tools", file jpegsr6a.zip. IJG requires all users of their code to read the readme.txt file.
Modifications required to run this code:
#include <stdio.h> /* * jcdctmgr.c * * Copyright (C) 1994-1996, Thomas G. Lane. * This file is part of the Independent JPEG Group's software. * For conditions of distribution and use, see the accompanying README file. * * This file contains the forward-DCT management logic. * This code selects a particular DCT implementation to be used, * and it performs related housekeeping chores including coefficient * quantization. */ #define JPEG_INTERNALS #include "jinclude.h" #include "jpeglib.h" #include "jdct.h" /* Private declarations for DCT subsystem */ /* Private subobject for this module */ typedef struct { struct jpeg_forward_dct pub; /* public fields */ /* Pointer to the DCT routine actually in use */ forward_DCT_method_ptr do_dct; /* The actual post-DCT divisors --- not identical to the quant table * entries, because of scaling (especially for an unnormalized DCT). * Each table is given in normal array order. */ DCTELEM * divisors[NUM_QUANT_TBLS]; #ifdef DCT_FLOAT_SUPPORTED /* Same as above for the floating-point case. */ float_DCT_method_ptr do_float_dct; FAST_FLOAT * float_divisors[NUM_QUANT_TBLS]; #endif } my_fdct_controller; typedef my_fdct_controller * my_fdct_ptr; /* * Initialize for a processing pass. * Verify that all referenced Q-tables are present, and set up * the divisor table for each one. * In the current implementation, DCT of all components is done during * the first pass, even if only some components will be output in the * first scan. Hence all components should be examined here. */ /*NEW CODE ADDED mmx_rounders is an array of two 64 entries corresponding to the two Quantization tables - Luminance and Chrominance. */ DCTELEM mmx_rounders[NUM_QUANT_TBLS][DCTSIZE2]; METHODDEF(void) start_pass_fdctmgr (j_compress_ptr cinfo) { my_fdct_ptr fdct = (my_fdct_ptr) cinfo->fdct; int ci, qtblno, i; jpeg_component_info *compptr; JQUANT_TBL * qtbl; DCTELEM * dtbl; for (ci = 0, compptr = cinfo->comp_info; ci < cinfo->num_components; ci++, compptr++) { qtblno = compptr->quant_tbl_no; /* Make sure specified quantization table is present */ if (qtblno < 0 || qtblno >= NUM_QUANT_TBLS || cinfo->quant_tbl_ptrs[qtblno] == NULL) ERREXIT1(cinfo, JERR_NO_QUANT_TABLE, qtblno); qtbl = cinfo->quant_tbl_ptrs[qtblno]; /* Compute divisors for this quant table */ /* We may do this more than once for same table, but it's not a big deal */ switch (cinfo->dct_method) { #ifdef DCT_ISLOW_SUPPORTED case JDCT_ISLOW: /* For LL&M IDCT method, divisors are equal to raw quantization * coefficients multiplied by 8 (to counteract scaling). */ if (fdct->divisors[qtblno] == NULL) { fdct->divisors[qtblno] = (DCTELEM *) (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE, DCTSIZE2 * SIZEOF(DCTELEM)); } dtbl = fdct->divisors[qtblno]; for (i = 0; i < DCTSIZE2; i++) { dtbl[i] = ((DCTELEM) qtbl->quantval[i]) << 3; } break; #endif #ifdef DCT_IFAST_SUPPORTED case JDCT_IFAST: { /* For AA&N IDCT method, divisors are equal to quantization * coefficients scaled by scalefactor[row]*scalefactor[col], where * scalefactor[0] = 1 * scalefactor[k] = cos(k*PI/16) * sqrt(2) for k=1..7 * We apply a further scale factor of 8. */ #define CONST_BITS 14 static const INT16 aanscales[DCTSIZE2] = { /* precomputed values scaled up by 14 bits */ 16384, 22725, 21407, 19266, 16384, 12873, 8867, 4520, 22725, 31521, 29692, 26722, 22725, 17855, 12299, 6270, 21407, 29692, 27969, 25172, 21407, 16819, 11585, 5906, 19266, 26722, 25172, 22654, 19266, 15137, 10426, 5315, 16384, 22725, 21407, 19266, 16384, 12873, 8867, 4520, 12873, 17855, 16819, 15137, 12873, 10114, 6967, 3552, 8867, 12299, 11585, 10426, 8867, 6967, 4799, 2446, 4520, 6270, 5906, 5315, 4520, 3552, 2446, 1247 }; SHIFT_TEMPS if (fdct->divisors[qtblno] == NULL) { fdct->divisors[qtblno] = (DCTELEM *) (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE, DCTSIZE2 * SIZEOF(DCTELEM)); } dtbl = fdct->divisors[qtblno]; for (i = 0; i < DCTSIZE2; i++) { dtbl[i] = (DCTELEM) DESCALE(MULTIPLY16V16((INT32) qtbl->quantval[i], (INT32) aanscales[i]), CONST_BITS-3); } } break; #endif #ifdef DCT_FLOAT_SUPPORTED case JDCT_FLOAT: { /* For float AA&N IDCT method, divisors are equal to quantization * coefficients scaled by scalefactor[row]*scalefactor[col], where * scalefactor[0] = 1 * scalefactor[k] = cos(k*PI/16) * sqrt(2) for k=1..7 * We apply a further scale factor of 8. * What's actually stored is 1/divisor so that the inner loop can * use a multiplication rather than a division. */ FAST_FLOAT * fdtbl; int row, col; static const double aanscalefactor[DCTSIZE] = { 1.0, 1.387039845, 1.306562965, 1.175875602, 1.0, 0.785694958, 0.541196100, 0.275899379 }; if (fdct->float_divisors[qtblno] == NULL) { fdct->float_divisors[qtblno] = (FAST_FLOAT *) (*cinfo->mem->alloc_small) ((j_common_ptr) cinfo, JPOOL_IMAGE, DCTSIZE2 * SIZEOF(FAST_FLOAT)); } fdtbl = fdct->float_divisors[qtblno]; i = 0; for (row = 0; row < DCTSIZE; row++) { for (col = 0; col < DCTSIZE; col++) { fdtbl[i] = (FAST_FLOAT) (1.0 / (((double) qtbl->quantval[i] * aanscalefactor[row] * aanscalefactor[col] * 8.0))); i++; } } } break; #endif default: ERREXIT(cinfo, JERR_NOT_COMPILED); break; } //end of case /*NEW CODE ADDED If an MMX machine is detected: "mmx_rounders" is used to round the Quantized values to integers. It is an array of two 64 entries corresponding to the two Quantization tables - Luminance and Chrominance. The idea is to store the rounding values (half the Quantization table value) in this array and add them to the DCT frequency components later. dtbl[] is overwritten with (2^16)/dtbl[]. This quantity can be multiplied to the sum of DCT frequency components and their rounding factors. The "divides" can thus be converted into "multiplies" speeding up the process significantly. */ if (MMXAvailable) { for (i=0; i<DCTSIZE2; i++) { mmx_rounders[qtblno][i]=dtbl[i]>>1; dtbl[i]=( 65536 + (dtbl[i]>>1))/dtbl[i]; //16bits } } } //end of loop } //end of func /* * Perform forward DCT on one or more blocks of a component. * * The input samples are taken from the sample_data[] array starting at * position start_row/start_col, and moving to the right for any additional * blocks. The quantized coefficients are returned in coef_blocks[]. */ METHODDEF(void) forward_DCT (j_compress_ptr cinfo, jpeg_component_info * compptr, JSAMPARRAY sample_data, JBLOCKROW coef_blocks, JDIMENSION start_row, JDIMENSION start_col, JDIMENSION num_blocks) /* This version is used for integer DCT implementations. */ { /* This routine is heavily used, so it's worth coding it tightly. */ my_fdct_ptr fdct = (my_fdct_ptr) cinfo->fdct; forward_DCT_method_ptr do_dct = fdct->do_dct; DCTELEM * divisors = fdct->divisors[compptr->quant_tbl_no]; DCTELEM workspace[DCTSIZE2]; /* work area for FDCT subroutine */ JDIMENSION bi; DCTELEM *workspaceptr = workspace; JCOEFPTR output; int i,j; sample_data += start_row; /* fold in the vertical offset once */ for (bi = 0; bi < num_blocks; bi++, start_col += DCTSIZE) { /* Load data into workspace, applying unsigned->signed conversion */ { register DCTELEM *workspaceptr; register JSAMPROW elemptr; register int elemr; workspaceptr = workspace; if (!MMXAvailable) { for (elemr = 0; elemr < DCTSIZE; elemr++) { elemptr = sample_data[elemr] + start_col; //printf("%i \n", sizeof(*workspaceptr)); *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE; *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE; *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE; *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE; *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE; *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE; *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE; *workspaceptr++ = GETJSAMPLE(*elemptr++) - CENTERJSAMPLE; } } else //if (MMXAvailable) { __int64 centersamp64=0x0080008000800080; __asm { mov eax, workspaceptr mov ebx, sample_data mov edx, start_col pxor mm7,mm7 mov ecx, [ebx+0] add ecx, edx //sample_data[0]+start_col movq mm6,centersamp64 movq mm0,[ecx] movq mm1,mm0 punpcklbw mm0,mm7 punpckhbw mm1,mm7 psubw mm0,mm6 psubw mm1,mm6 movq [eax],mm0 movq [eax+8],mm1 mov ecx, [ebx+1*4] add ecx, edx //sample_data[1]+start_col movq mm2,[ecx] movq mm3,mm2 punpcklbw mm2,mm7 punpckhbw mm3,mm7 psubw mm2,mm6 psubw mm3,mm6 movq [eax+16],mm2 movq [eax+24],mm3 mov ecx, [ebx+2*4] add ecx, edx //sample_data[2]+start_col movq mm4,[ecx] movq mm5,mm4 punpcklbw mm4,mm7 punpckhbw mm5,mm7 psubw mm4,mm6 psubw mm5,mm6 movq [eax+32],mm4 movq [eax+40],mm5 mov ecx, [ebx+3*4] add ecx, edx //sample_data[3]+start_col movq mm0,[ecx] movq mm1,mm0 punpcklbw mm0,mm7 punpckhbw mm1,mm7 psubw mm0,mm6 psubw mm1,mm6 movq [eax+48],mm0 movq [eax+56],mm1 mov ecx, [ebx+4*4] add ecx, edx //sample_data[4]+start_col movq mm2,[ecx] movq mm3,mm2 punpcklbw mm2,mm7 punpckhbw mm3,mm7 psubw mm2,mm6 psubw mm3,mm6 movq [eax+64],mm2 movq [eax+72],mm3 mov ecx, [ebx+5*4] add ecx, edx //sample_data[5]+start_col movq mm4,[ecx] movq mm5,mm4 punpcklbw mm4,mm7 punpckhbw mm5,mm7 psubw mm4,mm6 psubw mm5,mm6 movq [eax+80],mm4 movq [eax+88],mm5 mov ecx, [ebx+6*4] add ecx, edx //sample_data[6]+start_col movq mm0,[ecx] movq mm1,mm0 punpcklbw mm0,mm7 punpckhbw mm1,mm7 psubw mm0,mm6 psubw mm1,mm6 movq [eax+96],mm0 movq [eax+104],mm1 mov ecx, [ebx+7*4] add ecx, edx //sample_data[7]+start_col movq mm2,[ecx] movq mm3,mm2 punpcklbw mm2,mm7 punpckhbw mm3,mm7 psubw mm2,mm6 psubw mm3,mm6 movq [eax+112],mm2 movq [eax+120],mm3 // emms //done later, after quant } } } /* Perform the DCT */ (*do_dct) (workspace); if (MMXAvailable) { JCOEFPTR output_ptr = coef_blocks[bi];
__int64 pos_one =0x0001000100010001; __int64 neg_one =0xffffffffffffffff; //loading the address of mmx_rounders. DCTELEM *round_tbl = mmx_rounders[compptr->quant_tbl_no]; output=output_ptr; /* NEW CODE Quantize/descale the coefficients, and store into coef_blocks[] The original C-code took the DCT frequency coefficients and divided them by the values of the quant tables and rounded them to integers. Since divides are inefficient, they were converted to multiplies using the following technique: (DCT+QuantVal/2)/QuantVal = ( (DCT+QuantVal/2) * (2^16/QuantVal) ) >> 16 QuantVal/2 is precalculated and sotred in mmx_rounders. (2^16/QuantVal) is precalculated and stored in divisors. The old divisors are now multipliers. Finally, the pmulhw performs the "multiply" and the implied ">> 16" in one operation causing the performance gain. Also eleminated the branch for negetive DCT frequency Coefficients. */ __asm { xor ebx, ebx //zero the count mov esi, workspaceptr //load data mov ecx, divisors //load quantization multipliers mov edx, round_tbl //load rounding table mov eax, neg_round_tbl //load negative rounding table mov edi, output_ptr //load storage movq mm4, neg_one //FFFF FFFF FFFF FFFF movq mm5, pos_one //0001 0001 0001 0001 quant_loop: movq mm0,[esi+ebx] //load data movq mm3, mm0 //save copy of data pxor mm7, mm7 //clear mm7 movq mm1,[edx+ebx] //load rounder movq mm2, [ecx+ebx] //load quantization multipliers pcmpgtw mm7, mm3 //generate mask //negative words = FFFF //positive words = 0000 pxor mm0, mm4 //1's complement all paddw mm0, mm5 //2's complement to flip the sign paddw mm3, mm1 //add rounding factor to pos# paddw mm0, mm1 //add rounding factor to neg# pmulhw mm3, mm2 //multiply pos# and shift right 16bits pmulhw mm0, mm2 //multiply neg# and shift right 16bits pxor mm0, mm4 //1's complement all paddw mm0, mm5 //2's complement to flip the sign pand mm0, mm7 //mask and save the neg pandn mm7, mm3 //mask and save the pos por mm0, mm7 //combine pos and neg in mm0 movq [edi+ebx],mm0 //store data add ebx,8 //add 8 bytes (4 words) cmp ebx,128 //done yet (64 words*2bytes=128) jne quant_loop emms } } else { //not MMX #ifdef FAST_DIVIDE #define DIVIDE_BY(a,b) a /= b #else #define DIVIDE_BY(a,b) if (a >= b) a /= b; else a = 0 #endif {// register DCTELEM temp, qval; DCTELEM temp, qval; register int i; register JCOEFPTR output_ptr = coef_blocks[bi]; output=output_ptr; for (i = 0; i < DCTSIZE2; i++) { qval = divisors[i]; temp = workspace[i]; if (temp < 0) { temp = -temp; temp += qval>>1; // for rounding DIVIDE_BY(temp, qval); temp = -temp; } else { temp += qval>>1; // for rounding DIVIDE_BY(temp, qval); } output_ptr[i] = (JCOEF) temp; } } }//end of not MMX } }Code Listing 1: C and MMX(TM)Technology Implementation of JPEG Quantization of DCT coefficients.